1 Introduction to R and Data Visualization1

1.1 Basic Commands

R uses functions to perform operations. To run a function called funcname, we type funcname(input1, input2), where the inputs (or arguments) input1 argument and input2 tell R how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function c() (for concatenate). Any numbers inside the parentheses are joined together. The following command instructs R to join together the numbers 1, 3, 2, and 5, and to save them as a vector named x. When we type x, it gives us back the vector.

x <- c(1, 3, 2, 5) 
x
[1] 1 3 2 5

Note that the > is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using = rather than <-:

x = c(1, 6, 2)
x
[1] 1 6 2
y = c(1, 4, 3)
y
[1] 1 4 3

Hitting the up arrow multiple times will display the previous commands, which can then be edited. This is useful since one often wishes to repeat a similar command. In addition, typing ?funcname will always cause R to open a new help file window with additional information about the function funcname.

We can tell R to add two sets of numbers together. It will then add the first number from x to the first number from y, and so on. However, x and y should be the same length. We can check their length using the length() function.

length(x)
[1] 3
length(y)
[1] 3
x + y
[1]  2 10  5

The ls() function allows us to look at a list of all of the objects, such as data and functions, that we have saved so far. The rm() function can be used to delete any that we don’t want.

ls()
[1] "x" "y"
rm(x, y)
ls()
character(0)

It’s also possible to remove all objects at once:

rm(list = ls())

The matrix() function can be used to create a matrix of numbers. Before we use the matrix() function, we can learn more about it:

?matrix

The help file reveals that the matrix() function takes a number of inputs, but for now we focus on the first three: the data (the entries in the matrix), the number of rows, and the number of columns. First, we create a simple matrix.

x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Note that we could just as well omit typing data =, nrow =, and ncol = in the matrix() command above: that is, we could just type

x <- matrix(c(1, 2, 3, 4), 2, 2)

and this would have the same effect. However, it can sometimes be useful to specify the names of the arguments passed in, since otherwise R will assume that the function arguments are passed into the function in the same order that is given in the function’s help file. As this example illustrates, by default R creates matrices by successively filling in columns. Alternatively, the byrow = TRUE option can be used to populate the matrix in order of the rows.

matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Notice that in the above command we did not assign the matrix to a value such as x. In this case the matrix is printed to the screen but is not saved for future calculations. The sqrt() function returns the square root of each element of a vector or matrix. The command x^2 raises each element of x to the power 2; any powers are possible, including fractional or negative powers.

sqrt(x)
         [,1]     [,2]
[1,] 1.000000 1.732051
[2,] 1.414214 2.000000
x^2
     [,1] [,2]
[1,]    1    9
[2,]    4   16

The rnorm() function generates a vector of random normal variables, with first argument n the sample size. Each time we call this function, we will get a different answer. Here we create two correlated sets of numbers, x and y, and use the cor() function to compute the correlation between them.

x = rnorm(50)
y = x + rnorm(50, mean = 50, sd = 0.1)
cor(x, y)
[1] 0.9938363

By default, rnorm() creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments, as illustrated above. Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. The set.seed() function takes an (arbitrary) integer argument.

set.seed(1303)
rnorm(50)
 [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
 [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
[11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
[16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
[21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
[26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
[31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
[36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
[41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
[46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

We use set.seed() throughout the labs whenever we perform calculations involving random quantities. In general this should allow the user to reproduce our results. However, it should be noted that as new versions of R become available it is possible that some small discrepancies may form between the book and the output from R.

The mean() and var() functions can be used to compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function.

set.seed(3)
y <- rnorm(100)
mean(y)
[1] 0.01103557
var(y)
[1] 0.7328675
sqrt(var(y))
[1] 0.8560768
sd(y)
[1] 0.8560768

1.2 Graphics

The plot() function is the primary way to plot data in R. For instance, plot(x, y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the x-axis. To find out more information about the plot() function, type ?plot.

x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

plot(x, y, xlab = "this is the x-axis", ylab = "this is the y-axis", 
     main = "Plot of X vs Y")

We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. For instance, to create a jpeg, we use the jpeg() function, and to create a pdf, we use the pdf() function.

jpeg(file= "./JPG/YourFileName.jpeg")
plot(x, y, col = "green")
dev.off()
quartz_off_screen 
                2 

To display the saved file as shown in Figure 1.1, use the include_graphics() function from knitr.2

Using `knitr::include_graphics()`

Figure 1.1: Using knitr::include_graphics()

The function dev.off() indicates to R that we are done creating the plot. Alternatively, we can simply copy the plot window and paste it into an appropriate file type, such as a Word document.

The function seq() can be used to create a sequence of numbers. For instance, seq(a, b) makes a vector of integers between a and b. There are many other options: for instance, seq(0, 1, length = 10) makes a sequence of 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3, 11) for integer arguments.

x <- seq(1, 10)
x
 [1]  1  2  3  4  5  6  7  8  9 10
x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10
x = seq(-pi, pi, length = 50)

We will now create some more sophisticated plots. The contour() function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments:

  1. A vector of the x values (the first dimension),
  2. A vector of the y values (the second dimension), and
  3. A matrix whose elements correspond to the z value (the third dimension) for each pair of (x, y) coordinates.

As with the plot() function, there are many other inputs that can be used to fine-tune the output of the contour() function. To learn more about these, take a look at the help file by typing ?contour.

y <- x
f <- outer(x, y, function(x, y){
  cos(y) / (1 + x^2)
})
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = TRUE)

fa <- (f -t(f))/2
contour(x, y, fa, nlevels = 15)

The image() function works the same way as contour(), except that it produces a color-coded plot whose colors depend on the z value. This is known as a heatmap, and is sometimes used to plot temperature in weather forecasts. Alternatively, persp() can be used to produce a three-dimensional plot. The arguments theta and phi control the angles at which the plot is persp() viewed.

image(x, y, fa)

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

1.3 Indexing

We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A.

A <- matrix(1:16, 4, 4)
A
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Then, typing

A[2, 3]
[1] 10

will select the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by providing vectors as the indices.

A[c(1, 3), c(2, 4)]
     [,1] [,2]
[1,]    5   13
[2,]    7   15
A[1:3, 2:4]
     [,1] [,2] [,3]
[1,]    5    9   13
[2,]    6   10   14
[3,]    7   11   15
A[1:2, ]
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
A[, 1:2]
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

The last two examples include either no index for the columns or no index for the rows. These indicate that R should include all columns or all rows, respectively. R treats a single row or column of a matrix as a vector.

A[1, ]
[1]  1  5  9 13

The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index.

A[-c(1, 3), ]
     [,1] [,2] [,3] [,4]
[1,]    2    6   10   14
[2,]    4    8   12   16

The dim() function outputs the number of rows followed by the number of columns of a given matrix.

dim(A)
[1] 4 4

1.4 Loading Data

For most analyses, the first step involves importing a data set into R. The read.table() function is one of the primary ways to do this. The help file contains details about how to use this function. We can use the function write.table() to export data. Before attempting to load a data set, we must make sure that R knows to search for the data in the proper directory. For example on a Windows system one could select the directory using the Change dir. . . option under the File menu. However, the details of how to do this depend on the operating system (e.g. Windows, Mac, Unix) that is being used, and so we do not give further details here. We begin by loading in the Auto data set. This data is part of the ISLR2 library, discussed in Chapter 3. To illustrate the read.table() function we load it now from a text file, Auto.data, which can find on the textbook website. The following command will load the Auto.data file into R and store it as an object called Auto, in a format referred to as a data frame.

Auto <- read.table("Auto.data")
head(Auto) 
    V1        V2           V3         V4     V5           V6   V7     V8
1  mpg cylinders displacement horsepower weight acceleration year origin
2 18.0         8        307.0      130.0  3504.         12.0   70      1
3 15.0         8        350.0      165.0  3693.         11.5   70      1
4 18.0         8        318.0      150.0  3436.         11.0   70      1
5 16.0         8        304.0      150.0  3433.         12.0   70      1
6 17.0         8        302.0      140.0  3449.         10.5   70      1
                         V9
1                      name
2 chevrolet chevelle malibu
3         buick skylark 320
4        plymouth satellite
5             amc rebel sst
6               ford torino

Note that Auto.data is simply a text file, which you could alternatively open on your computer using a standard text editor. It is often a good idea to view a data set using a text editor or other software such as Excel before loading it into R. This particular data set has not been loaded correctly, because R has assumed that the variable names are part of the data and so has included them in the first row. The data set also includes a number of missing observations, indicated by a question mark ?. Missing values are a common occurrence in real data sets. Using the option header = TRUE in the read.table() function tells R that the first line of the file contains the variable names, and using the option na.strings tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.

Auto <- read.table("Auto.data", header = TRUE, sep = "", na.strings = "?", stringsAsFactors = TRUE)
head(Auto)  
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
5  17         8          302        140   3449         10.5   70      1
6  15         8          429        198   4341         10.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
5               ford torino
6          ford galaxie 500
library(DT)
datatable(Auto)
Auto[1:4, ]
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst

The dim() function tells us that the data has 397 observations, or rows, and 9 variables, or columns. There are various ways to deal with the missing data. In this case, only five of the rows contain missing observations, and so we choose to use the na.omit() function to simply remove these rows.

dim(Auto)
[1] 397   9
Auto2 <- na.omit(Auto)
dim(Auto2)
[1] 392   9

Once the data are loaded correctly, we can use names() to check the variable names.

names(Auto2)
[1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
[6] "acceleration" "year"         "origin"       "name"        

1.5 Additional Graphical and Numerical Summaries

We can use the plot() function to produce scatterplots of the quantitative variables. However, simply typing the variable names will produce an error message, because R does not know to look in the Auto data set for those variables. To refer to a variable, we must type the data set and the variable name joined with a $ symbol.

plot(Auto2$cylinders, Auto2$mpg)

plot(mpg ~ cylinders, data = Auto2)

with(data = Auto2,
     plot(cylinders, mpg)
     )

The cylinders variable is stored as a numeric vector, so R has treated it as quantitative. However, since there are only a small number of possible values for cylinders, one may prefer to treat it as a qualitative variable. The as.factor() function converts quantitative variables into qualitative variables.

Auto2$cylinders <- as.factor(Auto2$cylinders)

If the variable plotted on the x-axis is categorial, then boxplots will automatically be produced by the plot() function. As usual, a number of options can be specified in order to customize the plots.

plot(Auto2$cylinders, Auto2$mpg)

plot(mpg ~ cylinders, data = Auto2)

plot(mpg ~ cylinders, data = Auto2, col = "red")

plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE)

plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE, 
     horizontal = TRUE)

plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE, 
     horizontal = TRUE, xlab = "cylinders", ylab = "MPG")

The hist() function can be used to plot a histogram. Note that col = 2 has the same effect as col = "red".

hist(Auto2$mpg, col = "red", xlab = "MPG", main = "Your Title Here")

hist(Auto2$mpg, col = "red", xlab = "MPG", main = "Your Title Here", breaks = 15)

1.6 Creating boxplots and histograms with ggplot2

See the geom_boxplot documentation and the geom_freqpoly documentation for more details.

library(ggplot2)
p <- ggplot(data = Auto2, aes(x = cylinders, y = mpg))
p +  geom_boxplot()

p +  geom_boxplot() + 
  coord_flip()

p +  geom_boxplot() + 
  coord_flip() + 
  theme_bw()

p +  geom_boxplot(fill = "red") + 
  coord_flip() + 
  theme_bw()

p +  geom_boxplot(fill = "red") + 
  coord_flip() + theme_bw() + 
  labs(x = "Cylinders", y = "MPG")

p +  geom_boxplot(fill = "red", varwidth = TRUE) + 
  coord_flip() + 
  theme_bw() + 
  labs(x = "Cylinders", y = "MPG")

p <- ggplot(data = Auto2, aes(x = mpg))
p + geom_histogram()

p + geom_histogram(binwidth = 5)

p + geom_histogram(binwidth = 5, fill = "blue")

p + geom_histogram(binwidth = 5, fill = "blue", color = "black")

p + geom_histogram(binwidth = 5, fill = "blue", color = "black") + 
  theme_bw()

p + geom_histogram(binwidth = 5, fill = "blue", 
                   color = "black", aes(y = ..density..)) + 
  theme_bw()

1.6.1 Creating boxplots and histograms with ggvis

library(ggvis)
Auto2 %>% 
  ggvis(x = ~cylinders, y = ~mpg) %>% 
  layer_boxplots(fill := "red")
Auto2 %>% 
  ggvis(x = ~mpg) %>% 
  layer_histograms(fill := "lightblue", width = 1)
Auto2 %>% 
  ggvis(x = ~mpg) %>% 
  layer_histograms(fill := "pink", width = 5) %>% 
  add_axis("x", title = "Miles Per Gallon")

1.6.2 Using plotly

library(plotly)
p1 <- ggplot(data = Auto2, aes(x = cylinders, y = mpg)) + 
  geom_boxplot(fill = "red", varwidth = TRUE) + 
  coord_flip() + 
  theme_bw() + 
  labs(x = "Cylinders", y = "MPG")
p2 <- ggplotly(p1)
p2
p3 <- ggplot(data = Auto2, aes(x = mpg)) + 
  geom_histogram(binwidth = 5, fill = "blue", color = "black") + 
  theme_bw()
p4 <- ggplotly(p3)
p4

The summary() function produces a numerical summary of each variable in a particular data set.

summary(Auto2)
      mpg        cylinders  displacement     horsepower        weight    
 Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   5:  3     Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   6: 83     Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   8:103     3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                         
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365  

For qualitative variables such as name, R will list the number of observations that fall in each category. We can also produce a summary of just a single variable.

summary(Auto2$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   17.00   22.75   23.45   29.00   46.60 

1.7 Aggregating with dplyr

Consider producing summary statistics for the variable mpg when it is grouped by cylinders.

library(dplyr)
Auto2 %>%
  group_by(cylinders) %>%
  summarize(median(mpg), IQR(mpg), n())
# A tibble: 5 × 4
  cylinders `median(mpg)` `IQR(mpg)` `n()`
  <fct>             <dbl>      <dbl> <int>
1 3                  20.2       3.3      4
2 4                  28.4       7.95   199
3 5                  25.4       8.05     3
4 6                  19         3       83
5 8                  14         3      103

2 Linear Regression

2.1 Libraries

The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR package, which includes the data sets associated with this book.

library(MASS)
library(ISLR)
library(DT)

If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR, must be downloaded the first time they are used. This can be done directly from within R. For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR"). This installation only needs to be done the first time you use a package. However, the library() function must be called each time you wish to use a given package.

2.2 Simple Linear Regression

The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).

names(Boston)
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"   
datatable(Boston, rownames = FALSE)

To find out more about the data set, we can type ?Boston.

We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.

lm.fit <- lm(medv ~ lstat)
Error in eval(predvars, data, env): object 'medv' not found

The command causes an error because R does not know where to find the variables medv and lstat.

lm.fit <- lm(medv ~ lstat, data = Boston)

If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us p-values and standard errors for the coefficients, as well as the \(R^2\) statistic and F-statistic for the model.

lm.fit

Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  
summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients — it is safer to use the extractor functions like coef() to access them.

names(lm.fit)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
lm.fit$coefficients
(Intercept)       lstat 
 34.5538409  -0.9500494 
lm.fit[[1]]
(Intercept)       lstat 
 34.5538409  -0.9500494 
coef(lm.fit)
(Intercept)       lstat 
 34.5538409  -0.9500494 

In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.

confint(lm.fit, level = 0.95)
                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505

Consider constructing a confidence interval for \(\beta_1\) using the information provided from the summary of lm.fit:

summary(lm.fit)$coefficients
              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 34.5538409 0.56262735  61.41515 3.743081e-236
lstat       -0.9500494 0.03873342 -24.52790  5.081103e-88
b1 <- summary(lm.fit)$coefficients[2, 1]
sb1 <- summary(lm.fit)$coefficients[2, 2]
ct <- qt(0.975, summary(lm.fit)$df[2])
CI <- b1 + c(-1, 1)*ct*sb1
CI
[1] -1.0261482 -0.8739505

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.

CI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)), 
              interval = "confidence")
CI
       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
PI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)), 
              interval = "predict")
PI
       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846

For instance, the 95% confidence interval associated with a lstat value of 10 is (24.474132, 25.6325627), and the 95 % prediction interval is (12.8276263, 37.2790683). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.0533473 for medv when lstat equals 10), but the latter are substantially wider.

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.

plot(medv ~ lstat, data = Boston)
abline(lm.fit)

# Or using ggplot2
library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  theme_bw()

There is some evidence for non-linearity in the relationship between lstat and medv. We will explore this issue later in this lab.

The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 command causes the width of the regression line to be increased by a factor of 3; this works for the plot() and lines() functions also. We can also use the pch option to create different plotting symbols.

plot(medv ~ lstat, data = Boston)
abline(lm.fit, lwd = 3)

plot(medv ~ lstat, data = Boston)
abline(lm.fit, lwd = 3, col = "red")

plot(medv ~ lstat, data = Boston, pch = 20)

plot(medv ~ lstat, data = Boston, pch = "+")

plot(1:20, 1:20, pch = 1:20)

2.2.1 Using ggplot2

ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  theme_bw()

# thicker line
ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", size = 2) +
  theme_bw()

# Different plotting shapes
ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point(shape = "+", size = 5) +
  geom_smooth(method = "lm", color = "red") +
  theme_bw()

ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point(shape = 23) +
  geom_smooth(method = "lm", color = "red") +
  theme_bw()

Next we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() function, which tells R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a 2 \(\times\) 2 grid of panels.

opar <- par(no.readonly = TRUE)  # copy of current settings
par(mfrow = c(2, 2))             # 2 * 2
plot(lm.fit)

par(opar)                        # restore original settings

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

opar <- par(no.readonly = TRUE)  # copy of current settings
par(mfrow = c(1, 2))
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))

par(opar)                        # restore original settings

I prefer to use the function residualPlots from the car package to evaluate residuals.

car::residualPlots(lm.fit)

           Test stat Pr(>|Test stat|)    
lstat         11.627        < 2.2e-16 ***
Tukey test    11.627        < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues function. The function influenceIndexPlot from the carpackage creates four diagnostic plots including a plot of the hav-values.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))
375 
375 
car::influenceIndexPlot(lm.fit, id.n = 5)

The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.

Recall that hat-values exceeding \((p + 1)/n = (1 + 1)/504 = 0.0039683\) should be evaluated for high leverage.

2.3 Multiple Linear Regression

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y ∼ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

ls.fit <- lm(medv ~ lstat + age, data = Boston)
summary(ls.fit)

Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

ls.fit <- lm(medv ~ ., data = Boston)
summary(ls.fit)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages option in R.

library(car)
vif(ls.fit)
    crim       zn    indus     chas      nox       rm      age      dis 
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
     rad      tax  ptratio    black    lstat 
7.484496 9.008554 1.799084 1.348521 2.941491 

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

ls.fit1 <- lm(medv ~ . - age, data = Boston)
summary(ls.fit1)

Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16
# Or
ls.fit2 <-update(ls.fit, .~. - age)
summary(ls.fit2)

Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
    rad + tax + ptratio + black + lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

2.3.1 Interaction Terms

It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat\(\times\)age as predictors; it is a shorthand for lstat + age + lstat:age.

summary(lm(medv ~ lstat*age, data = Boston))

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

2.3.2 Non-Linear Transformations for the Predictors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning in a formula; wrapping as we do allows the standard usage in R, which is I() to raise \(X\) to the power 2. We now perform a regression of medv onto lstat and lstat\(^2\).

lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)

Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.

anova(lm.fit, lm.fit2)
Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here Model 1 (lm.fit) represents the linear submodel containing only one predictor, lstat, while Model 2 (lm.fit2) corresponds to the larger quadratic model that has two predictors, lstat and lstat\(^2\). The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135.1998221 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat\(^2\) is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type

par(mfrow = c(2, 2))
plot(lm.fit2)
par(mfrow = c(1, 1))

then we see that when the lstat\(^2\) term is included in the model, there is little discernible pattern in the residuals.

In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)

Call:
lm(formula = medv ~ poly(lstat, 5), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,    Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant p-values in a regression fit.

library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) + 
  geom_point() + 
  theme_bw() + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 5))
Fifth order polynomial fit

Figure 2.1: Fifth order polynomial fit

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

summary(lm(medv ~ log(rm), data = Boston))

Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -76.488      5.028  -15.21   <2e-16 ***
log(rm)       54.055      2.739   19.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16
library(broom)
library(dplyr)
lm(medv ~ log(rm), data = Boston) %>%
  tidy() %>%
  knitr::kable(caption = "Summary of the coefficients created using pipes")
Table 2.1: Summary of the coefficients created using pipes
term estimate std.error statistic p.value
(Intercept) -76.48782 5.027725 -15.21321 0
log(rm) 54.05457 2.739460 19.73183 0
ggplot(data = Boston, aes(x = log(rm), y = medv)) +
  geom_point() +
  theme_bw() + 
  stat_smooth(method = "lm")

2.3.3 Qualitative Predictors

We will now examine the Carseats data, which is part of the ISLR package. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

names(Carseats)
 [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
 [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
[11] "US"         

The Carseats data includes qualitative predictors such as Shelveloc, an indicator of the quality of the shelving location—–that is, the space within a store in which the car seat is displayed—–at each location. The predictor Shelveloc takes on three possible values, Bad, Medium, and Good. Given a qualitative variable such as Shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
summary(lm.fit)

Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
Income              0.0108940  0.0026044   4.183 3.57e-05 ***
Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
Population          0.0001592  0.0003679   0.433 0.665330    
Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
Age                -0.0579466  0.0159506  -3.633 0.000318 ***
Education          -0.0208525  0.0196131  -1.063 0.288361    
UrbanYes            0.1401597  0.1124019   1.247 0.213171    
USYes              -0.1575571  0.1489234  -1.058 0.290729    
Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
Price:Age           0.0001068  0.0001333   0.801 0.423812    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8719 
F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The contrasts() function returns the coding that R uses for the dummy variables.

contrasts(Carseats$ShelveLoc)
       Good Medium
Bad       0      0
Good      1      0
Medium    0      1

Use ?contrasts to learn about other contrasts, and how to set them. R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location but lower sales than a good shelving location.

3 Classification Methods

3.1 The Stock Market Data

We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 library. This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). Our goal is to predict Direction (a qualitative response) using the other features.

library(ISLR2)
names(Smarket)
[1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
[7] "Volume"    "Today"     "Direction"
dim(Smarket)
[1] 1250    9
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           
pairs(Smarket)

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.

cor(Smarket)
#Error in cor(Smarket) : ‘x’ must be numeric
cor(Smarket[, -9])
             Year         Lag1         Lag2         Lag3         Lag4
Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
               Lag5      Volume        Today
Year    0.029787995  0.53900647  0.030095229
Lag1   -0.005674606  0.04090991 -0.026155045
Lag2   -0.003557949 -0.04338321 -0.010250033
Lag3   -0.018808338 -0.04182369 -0.002447647
Lag4   -0.027083641 -0.04841425 -0.006899527
Lag5    1.000000000 -0.02200231 -0.034860083
Volume -0.022002315  1.00000000  0.014591823
Today  -0.034860083  0.01459182  1.000000000

As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data, which is ordered chronologically, we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

attach(Smarket)
plot(Volume)

3.2 Logistic Regression

Next, we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function can be used to fit many types of generalized linear models, including logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,data = Smarket, family = binomial)
summary(glm.fits)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

We use the coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the p-values for the coefficients.

coef(glm.fits)
 (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
-0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068 
      Volume 
 0.135440659 
summary(glm.fits)$coef
                Estimate Std. Error    z value  Pr(>|z|)
(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
Lag3         0.011085108 0.04993854  0.2219750 0.8243333
Lag4         0.009358938 0.04997413  0.1872757 0.8514445
Lag5         0.010313068 0.04951146  0.2082966 0.8349974
Volume       0.135440659 0.15835970  0.8552723 0.3924004
summary(glm.fits)$coef[, 4]
(Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
  0.6006983   0.1452272   0.3983491   0.8243333   0.8514445   0.8349974 
     Volume 
  0.3924004 

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = "response" option tells R to output probabilities of the form \(P(Y = 1|X)\), as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

glm.probs <- predict(glm.fits , type = "response")
glm.probs[1:10]
        1         2         3         4         5         6         7         8 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 
        9        10 
0.5176135 0.4888378 
contrasts(Direction)
     Up
Down  0
Up    1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"

The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

table(glm.pred , Direction)
        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507
(507 + 145) / 1250
[1] 0.5216
mean(glm.pred == Direction)
[1] 0.5216

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market \(52.2\%\) of the time.

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1,250 observations. In other words, \(100\% - 52.2\% = 47.8\%\), is the training error rate. As we have seen previously, the training error rate is often overly optimistic- it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
[1] 252   9
Direction.2005 <- Direction[!train]

The object train is a vector of 1,250 elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to FALSE. The object train is a Boolean vector, since its elements are TRUE and FALSE. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For instance, the command Smarket[train, ] would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE. The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train, ] yields a submatrix of the stock market data containing only the observations for which train is FALSE- that is, the observations with dates in 2005. The output above indicates that there are 252 such observations.

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set- that is, for the days in 2005.

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Smarket , family = binomial , subset = train)
glm.probs <- predict(glm.fits , Smarket.2005, type = "response")

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
        Direction.2005
glm.pred Down Up
    Down   77 97
    Up     34 44
mean(glm.pred == Direction.2005)
[1] 0.4801587
mean(glm.pred != Direction.2005)
[1] 0.5198413

The != notation means not equal to, and so the last command computes the test set error rate. The results are rather disappointing: the test error rate is \(52\%\), which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance. (After all, if it were possible to do so, then the authors of this book would be out striking it rich rather than writing a statistics textbook.)

We recall that the logistic regression model had very underwhelming pvalues associated with all of the predictors, and that the smallest p-value, though not very small, corresponded to Lag1. Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

glm.fits <- glm(Direction ~ Lag1 + Lag2 , data = Smarket , family = binomial , subset = train)
glm.probs <- predict(glm.fits , Smarket.2005,type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
        Direction.2005
glm.pred Down  Up
    Down   35  35
    Up     76 106
mean(glm.pred == Direction.2005)
[1] 0.5595238
106 / (106 + 76)
[1] 0.5824176

Now the results appear to be a little better: \(56\%\) of the daily movements have been correctly predicted. It is worth noting that in this case, a much simpler strategy of predicting that the market will increase every day will also be correct \(56\%\) of the time! Hence, in terms of overall error rate, the logistic regression method is no better than the naive approach. However, the confusion matrix shows that on days when logistic regression predicts an increase in the market, it has a \(58\%\) accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.

Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and -0.8. We do this using the predict() function.

predict(glm.fits, newdata = data.frame(Lag1 = c(1.2 , 1.5) , Lag2 = c(1.1 , -0.8)),type = "response")
        1         2 
0.4791462 0.4960939 

3.3 K-Nearest Neighbors

We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other modelfitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.

  1. A matrix containing the predictors associated with the training data, labeled train.X below.

  2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.

  3. A vector containing the class labels for the training observations, labeled train.Direction below.

  4. A value for \(K\), the number of nearest neighbors to be used by the classifier.

We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set.

library(class)
train.X <- cbind(Lag1 , Lag2)[train , ]
test.X <- cbind(Lag1 , Lag2)[!train , ]
train.Direction <- Direction[train]

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

set.seed (1)
knn.pred <- knn(train.X, test.X, train.Direction , k = 1)
table(knn.pred , Direction.2005)
        Direction.2005
knn.pred Down Up
    Down   43 58
    Up     68 83
(83 + 43) / 252
[1] 0.5

The results using \(K = 1\) are not very good, since only \(50\%\) of the observations are correctly predicted. Of course, it may be that \(K = 1\) results in an overly flexible fit to the data. Below, we repeat the analysis using \(K = 3\).

knn.pred <- knn(train.X, test.X, train.Direction , k = 3)
table(knn.pred , Direction.2005)
        Direction.2005
knn.pred Down Up
    Down   48 54
    Up     63 87
mean(knn.pred == Direction.2005)
[1] 0.5357143

The results have improved slightly. But increasing \(K\) further turns out to provide no further improvements.

KNN does not perform well on the Smarket data but it does often provide impressive results. As an example we will apply the KNN approach to the Caravan data set, which is part of the ISLR2 library. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only \(6\%\) of people purchased caravan insurance.

dim(Caravan)
[1] 5822   86
attach(Caravan)
summary(Purchase)
  No  Yes 
5474  348 
348/5822
[1] 0.05977327

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. For instance, imagine a data set that contains two variables, salary and age (measured in dollars and years, respectively). As far as KNN is concerned, a difference of \(\$1,000\) in salary is enormous compared to a difference of 50 years in age. Consequently, salary will drive the KNN classification results, and age will have almost no effect. This is contrary to our intuition that a salary difference of \(\$1,000\) is quite small compared to an age difference of 50 years. Furthermore, the importance of scale to the KNN classifier leads to another issue: if we measured salary in Japanese yen, or if we measured age in minutes, then we’d get quite different classification results from what we get if these two variables are measured in dollars and years.

A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column 86, because that is the qualitative Purchase variable.

standardized.X <- scale(Caravan[, -86])
var(Caravan[, 1])
[1] 165.0378
var(Caravan[, 2])
[1] 0.1647078
var(standardized.X[, 1])
[1] 1
var(standardized.X[, 2])
[1] 1

Now every column of standardized.X has a standard deviation of one and a mean of zero.

We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using \(K = 1\), and evaluate its performance on the test data.

test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed (1)
knn.pred <- knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)
[1] 0.118
mean(test.Y != "No")
[1] 0.059

The vector test is numeric, with values from 1 through 1, 000. Typing standardized.X[test, ] yields the submatrix of the data containing the observations whose indices range from 1 to 1, 000, whereas typing standardized.X[-test, ] yields the submatrix containing the observations whose indices do not range from 1 to 1, 000. The KNN error rate on the 1,000 test observations is just under \(12\%\). At first glance, this may appear to be fairly good. However, since only \(6\%\) of customers purchased insurance, we could get the error rate down to \(6\%\) by always predicting No regardless of the values of the predictors!

Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only \(6\%\), which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.

It turns out that KNN with \(K = 1\) does far better than random guessing among the customers that are predicted to buy insurance. Among 77 such customers, 9, or \(11.7\%\), actually do purchase insurance. This is double the rate that one would obtain from random guessing.

table(knn.pred , test.Y)
        test.Y
knn.pred  No Yes
     No  873  50
     Yes  68   9
9 / (68 + 9)
[1] 0.1168831

Using \(K = 3\), the success rate increases to \(19\%\), and with \(K = 5\) the rate is \(26.7\%\). This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!

knn.pred <- knn(train.X, test.X, train.Y, k = 3)
table(knn.pred , test.Y)
        test.Y
knn.pred  No Yes
     No  920  54
     Yes  21   5
5 / 26
[1] 0.1923077
knn.pred <- knn(train.X, test.X, train.Y, k = 5)
table(knn.pred , test.Y)
        test.Y
knn.pred  No Yes
     No  930  55
     Yes  11   4
4 / 15
[1] 0.2666667

However, while this strategy is cost-effective, it is worth noting that only 15 customers are predicted to purchase insurance using KNN with \(K = 5\). In practice, the insurance company may wish to expend resources on convincing more than just 15 potential customers to buy insurance.

As a comparison, we can also fit a logistic regression model to the data. If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem: only seven of the test observations are predicted to purchase insurance. Even worse, we are wrong about all of these! However, we are not required to use a cut-off of 0.5. If we instead predict a purchase any time the predicted probability of purchase exceeds 0.25, we get much better results: we predict that 33 people will purchase insurance, and we are correct for about \(33\%\) of these people. This is over five times better than random guessing!

glm.fits <- glm(Purchase ~ ., data = Caravan ,family = binomial , subset = -test)
glm.probs <- predict(glm.fits, Caravan[test , ],type = "response")
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .5] <- "Yes"
table(glm.pred , test.Y)
        test.Y
glm.pred  No Yes
     No  934  59
     Yes   7   0
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .25] <- "Yes"
table(glm.pred , test.Y)
        test.Y
glm.pred  No Yes
     No  919  48
     Yes  22  11
11 / (22+11)
[1] 0.3333333

3.4 Poisson Regression

Finally, we fit a Poisson regression model to the Bikeshare data set, which measures the number of bike rentals (bikers) per hour in Washington, DC. The data can be found in the ISLR2 library.

attach(Bikeshare)
dim(Bikeshare)
[1] 8645   15
names(Bikeshare)
 [1] "season"     "mnth"       "day"        "hr"         "holiday"   
 [6] "weekday"    "workingday" "weathersit" "temp"       "atemp"     
[11] "hum"        "windspeed"  "casual"     "registered" "bikers"    

We begin by fitting a least squares linear regression model to the data.

mod.lm <- lm( bikers ~ mnth + hr + workingday + temp + weathersit , data = Bikeshare)
summary(mod.lm)

Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    data = Bikeshare)

Residuals:
    Min      1Q  Median      3Q     Max 
-299.00  -45.70   -6.23   41.08  425.29 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -68.632      5.307 -12.932  < 2e-16 ***
mnthFeb                      6.845      4.287   1.597 0.110398    
mnthMarch                   16.551      4.301   3.848 0.000120 ***
mnthApril                   41.425      4.972   8.331  < 2e-16 ***
mnthMay                     72.557      5.641  12.862  < 2e-16 ***
mnthJune                    67.819      6.544  10.364  < 2e-16 ***
mnthJuly                    45.324      7.081   6.401 1.63e-10 ***
mnthAug                     53.243      6.640   8.019 1.21e-15 ***
mnthSept                    66.678      5.925  11.254  < 2e-16 ***
mnthOct                     75.834      4.950  15.319  < 2e-16 ***
mnthNov                     60.310      4.610  13.083  < 2e-16 ***
mnthDec                     46.458      4.271  10.878  < 2e-16 ***
hr1                        -14.579      5.699  -2.558 0.010536 *  
hr2                        -21.579      5.733  -3.764 0.000168 ***
hr3                        -31.141      5.778  -5.389 7.26e-08 ***
hr4                        -36.908      5.802  -6.361 2.11e-10 ***
hr5                        -24.135      5.737  -4.207 2.61e-05 ***
hr6                         20.600      5.704   3.612 0.000306 ***
hr7                        120.093      5.693  21.095  < 2e-16 ***
hr8                        223.662      5.690  39.310  < 2e-16 ***
hr9                        120.582      5.693  21.182  < 2e-16 ***
hr10                        83.801      5.705  14.689  < 2e-16 ***
hr11                       105.423      5.722  18.424  < 2e-16 ***
hr12                       137.284      5.740  23.916  < 2e-16 ***
hr13                       136.036      5.760  23.617  < 2e-16 ***
hr14                       126.636      5.776  21.923  < 2e-16 ***
hr15                       132.087      5.780  22.852  < 2e-16 ***
hr16                       178.521      5.772  30.927  < 2e-16 ***
hr17                       296.267      5.749  51.537  < 2e-16 ***
hr18                       269.441      5.736  46.976  < 2e-16 ***
hr19                       186.256      5.714  32.596  < 2e-16 ***
hr20                       125.549      5.704  22.012  < 2e-16 ***
hr21                        87.554      5.693  15.378  < 2e-16 ***
hr22                        59.123      5.689  10.392  < 2e-16 ***
hr23                        26.838      5.688   4.719 2.41e-06 ***
workingday                   1.270      1.784   0.711 0.476810    
temp                       157.209     10.261  15.321  < 2e-16 ***
weathersitcloudy/misty     -12.890      1.964  -6.562 5.60e-11 ***
weathersitlight rain/snow  -66.494      2.965 -22.425  < 2e-16 ***
weathersitheavy rain/snow -109.745     76.667  -1.431 0.152341    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared:  0.6745,    Adjusted R-squared:  0.6731 
F-statistic: 457.3 on 39 and 8605 DF,  p-value: < 2.2e-16

In mod.lm, the first level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient estimates are provided for them: implicitly, their coefficient estimates are zero, and all other levels are measured relative to these baselines. For example, the Feb coefficient of 6.845 signifies that, holding all other variables constant, there are on average about 7 more riders in February than in January. Similarly there are about 16.5 more riders in March than in January.

A slightly different coding of the variables hr and mnth, is given as follows:

contrasts(Bikeshare$hr) = contr.sum(24)
contrasts(Bikeshare$mnth) = contr.sum(12)
mod.lm2 <- lm(bikers ~ mnth + hr + workingday + temp + weathersit ,data = Bikeshare)
summary(mod.lm2)

Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    data = Bikeshare)

Residuals:
    Min      1Q  Median      3Q     Max 
-299.00  -45.70   -6.23   41.08  425.29 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 73.5974     5.1322  14.340  < 2e-16 ***
mnth1                      -46.0871     4.0855 -11.281  < 2e-16 ***
mnth2                      -39.2419     3.5391 -11.088  < 2e-16 ***
mnth3                      -29.5357     3.1552  -9.361  < 2e-16 ***
mnth4                       -4.6622     2.7406  -1.701  0.08895 .  
mnth5                       26.4700     2.8508   9.285  < 2e-16 ***
mnth6                       21.7317     3.4651   6.272 3.75e-10 ***
mnth7                       -0.7626     3.9084  -0.195  0.84530    
mnth8                        7.1560     3.5347   2.024  0.04295 *  
mnth9                       20.5912     3.0456   6.761 1.46e-11 ***
mnth10                      29.7472     2.6995  11.019  < 2e-16 ***
mnth11                      14.2229     2.8604   4.972 6.74e-07 ***
hr1                        -96.1420     3.9554 -24.307  < 2e-16 ***
hr2                       -110.7213     3.9662 -27.916  < 2e-16 ***
hr3                       -117.7212     4.0165 -29.310  < 2e-16 ***
hr4                       -127.2828     4.0808 -31.191  < 2e-16 ***
hr5                       -133.0495     4.1168 -32.319  < 2e-16 ***
hr6                       -120.2775     4.0370 -29.794  < 2e-16 ***
hr7                        -75.5424     3.9916 -18.925  < 2e-16 ***
hr8                         23.9511     3.9686   6.035 1.65e-09 ***
hr9                        127.5199     3.9500  32.284  < 2e-16 ***
hr10                        24.4399     3.9360   6.209 5.57e-10 ***
hr11                       -12.3407     3.9361  -3.135  0.00172 ** 
hr12                         9.2814     3.9447   2.353  0.01865 *  
hr13                        41.1417     3.9571  10.397  < 2e-16 ***
hr14                        39.8939     3.9750  10.036  < 2e-16 ***
hr15                        30.4940     3.9910   7.641 2.39e-14 ***
hr16                        35.9445     3.9949   8.998  < 2e-16 ***
hr17                        82.3786     3.9883  20.655  < 2e-16 ***
hr18                       200.1249     3.9638  50.488  < 2e-16 ***
hr19                       173.2989     3.9561  43.806  < 2e-16 ***
hr20                        90.1138     3.9400  22.872  < 2e-16 ***
hr21                        29.4071     3.9362   7.471 8.74e-14 ***
hr22                        -8.5883     3.9332  -2.184  0.02902 *  
hr23                       -37.0194     3.9344  -9.409  < 2e-16 ***
workingday                   1.2696     1.7845   0.711  0.47681    
temp                       157.2094    10.2612  15.321  < 2e-16 ***
weathersitcloudy/misty     -12.8903     1.9643  -6.562 5.60e-11 ***
weathersitlight rain/snow  -66.4944     2.9652 -22.425  < 2e-16 ***
weathersitheavy rain/snow -109.7446    76.6674  -1.431  0.15234    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared:  0.6745,    Adjusted R-squared:  0.6731 
F-statistic: 457.3 on 39 and 8605 DF,  p-value: < 2.2e-16

What is the difference between the two codings? In mod.lm2, a coefficient estimate is reported for all but the last level of hr and mnth. Importantly, in mod.lm2, the coefficient estimate for the last level of mnth is not zero: instead, it equals the negative of the sum of the coefficient estimates for all of the other levels. Similarly, in mod.lm2, the coefficient estimate for the last level of hr is the negative of the sum of the coefficient estimates for all of the other levels. This means that the coefficients of hr and mnth in mod.lm2 will always sum to zero, and can be interpreted as the difference from the mean level. For example, the coefficient for January of −46.087 indicates that, holding all other variables constant, there are typically 46 fewer riders in January relative to the yearly average.

It is important to realize that the choice of coding really does not matter, provided that we interpret the model output correctly in light of the coding used. For example, we see that the predictions from the linear model are the same regardless of coding:

sum(( predict(mod.lm) - predict(mod.lm2))^2)
[1] 1.586608e-18

The sum of squared differences is zero. We can also see this using the all.equal() function:

all.equal(predict(mod.lm), predict(mod.lm2))
[1] TRUE

The left panel of the figure below displays the coefficients connected with each month of the year from the least squares linear regression model, ‘mod.lm2’. The coefficients for January through November can be directly obtained from the ‘mod.lm2’ object. However, the coefficient for December must be calculated as the negative sum of all the other months. On the other hand, the right panel exhibits the coefficients related to the hour of the day. It indicates that bike usage is highest during peak commute times and lowest overnight. We first obtain the coefficient estimates associated with mnth and hr.

coef.months <- c(coef(mod.lm2)[2:12],-sum(coef(mod.lm2)[2:12]))
coef.hours <- c(coef(mod.lm2)[13:35] , -sum(coef(mod.lm2)[13:35]))

To make the plot, we manually label the x-axis with the names of the months.

par(mfrow = c(1, 2))
plot(coef.months , xlab = "Month", ylab = "Coefficient", xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M", "J", "J", "A", "S", "O", "N", "D"))

plot(coef.hours , xlab = "Hour", ylab = "Coefficient",col = "blue", pch = 19, type = "o")

Now, we consider instead fitting a Poisson regression model to the Bikeshare data. Very little changes, except that we now use the function glm() with the argument family = poisson to specify that we wish to fit a Poisson regression model:

mod.pois <- glm( bikers ~ mnth + hr + workingday + temp + weathersit ,data = Bikeshare , family = poisson)
summary(mod.pois)

Call:
glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    family = poisson, data = Bikeshare)

Coefficients:
                           Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                4.118245   0.006021  683.964  < 2e-16 ***
mnth1                     -0.670170   0.005907 -113.445  < 2e-16 ***
mnth2                     -0.444124   0.004860  -91.379  < 2e-16 ***
mnth3                     -0.293733   0.004144  -70.886  < 2e-16 ***
mnth4                      0.021523   0.003125    6.888 5.66e-12 ***
mnth5                      0.240471   0.002916   82.462  < 2e-16 ***
mnth6                      0.223235   0.003554   62.818  < 2e-16 ***
mnth7                      0.103617   0.004125   25.121  < 2e-16 ***
mnth8                      0.151171   0.003662   41.281  < 2e-16 ***
mnth9                      0.233493   0.003102   75.281  < 2e-16 ***
mnth10                     0.267573   0.002785   96.091  < 2e-16 ***
mnth11                     0.150264   0.003180   47.248  < 2e-16 ***
hr1                       -0.754386   0.007879  -95.744  < 2e-16 ***
hr2                       -1.225979   0.009953 -123.173  < 2e-16 ***
hr3                       -1.563147   0.011869 -131.702  < 2e-16 ***
hr4                       -2.198304   0.016424 -133.846  < 2e-16 ***
hr5                       -2.830484   0.022538 -125.586  < 2e-16 ***
hr6                       -1.814657   0.013464 -134.775  < 2e-16 ***
hr7                       -0.429888   0.006896  -62.341  < 2e-16 ***
hr8                        0.575181   0.004406  130.544  < 2e-16 ***
hr9                        1.076927   0.003563  302.220  < 2e-16 ***
hr10                       0.581769   0.004286  135.727  < 2e-16 ***
hr11                       0.336852   0.004720   71.372  < 2e-16 ***
hr12                       0.494121   0.004392  112.494  < 2e-16 ***
hr13                       0.679642   0.004069  167.040  < 2e-16 ***
hr14                       0.673565   0.004089  164.722  < 2e-16 ***
hr15                       0.624910   0.004178  149.570  < 2e-16 ***
hr16                       0.653763   0.004132  158.205  < 2e-16 ***
hr17                       0.874301   0.003784  231.040  < 2e-16 ***
hr18                       1.294635   0.003254  397.848  < 2e-16 ***
hr19                       1.212281   0.003321  365.084  < 2e-16 ***
hr20                       0.914022   0.003700  247.065  < 2e-16 ***
hr21                       0.616201   0.004191  147.045  < 2e-16 ***
hr22                       0.364181   0.004659   78.173  < 2e-16 ***
hr23                       0.117493   0.005225   22.488  < 2e-16 ***
workingday                 0.014665   0.001955    7.502 6.27e-14 ***
temp                       0.785292   0.011475   68.434  < 2e-16 ***
weathersitcloudy/misty    -0.075231   0.002179  -34.528  < 2e-16 ***
weathersitlight rain/snow -0.575800   0.004058 -141.905  < 2e-16 ***
weathersitheavy rain/snow -0.926287   0.166782   -5.554 2.79e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  228041  on 8605  degrees of freedom
AIC: 281159

Number of Fisher Scoring iterations: 5

We can plot the coefficients associated with mnth and hr

par(mfrow = c(1, 2))
coef.mnth <- c(coef(mod.pois)[2:12] ,-sum(coef(mod.pois)[2:12]))
plot(coef.mnth , xlab = "Month", ylab = "Coefficient",xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M","J", "J", "A", "S", "O", "N", "D"))
coef.hours <- c(coef(mod.pois)[13:35] ,-sum(coef(mod.pois)[13:35]))
plot(coef.hours , xlab = "Hour", ylab = "Coefficient",col = "blue", pch = 19, type = "o")

We can once again use the predict() function to obtain the fitted values (predictions) from this Poisson regression model. However, we must use the argument type = "response" to specify that we want R to output \(\exp \left(\hat{\beta}_0+\right.\) \(\left.\hat{\beta}_1 X_1+\ldots+\hat{\beta}_p X_p\right)\) rather than \(\hat{\beta}_0+\hat{\beta}_1 X_1+\ldots+\hat{\beta}_p X_p\), which it will output by default.

plot(predict(mod.lm2), predict(mod.pois , type = "response"))
abline (0, 1, col = 2, lwd = 3)

The predictions from the Poisson regression model are correlated with those from the linear model; however, the former are non-negative. As a result the Poisson regression predictions tend to be larger than those from the linear model for either very low or very high levels of ridership.

In this section, we used the glm() function with the argument family = poisson in order to perform Poisson regression. Earlier in this lab we used the glm() function with family = binomial to perform logistic regression. Other choices for the family argument can be used to fit other types of GLMs. For instance, family = Gamma fits a gamma regression model.

4 Cross-Validation

4.1 The Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set.

Before we begin, we use the set.seed() function in order to set a seed for seed R’s random number generator, so that the reader will obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.

We begin by using the sample() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.

rm(Auto) #Remove Imported Auto data to use Auto Data from ISLR
library(ISLR)
set.seed(1)
train <- sample(392, 196)

(Here we use a shortcut in the sample command; see ?sample for details.) We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.

lm.fitA <- lm(mpg ~ horsepower, data = Auto, subset = train)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fitB <- lm(mpg ~ horsepower, data = trainSET)
summary(lm.fitA)

Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3177 -3.5428 -0.5591  2.3910 14.6836 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 41.283548   1.044352   39.53   <2e-16 ***
horsepower  -0.169659   0.009556  -17.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.032 on 194 degrees of freedom
Multiple R-squared:  0.619, Adjusted R-squared:  0.6171 
F-statistic: 315.2 on 1 and 194 DF,  p-value: < 2.2e-16
summary(lm.fitB)

Call:
lm(formula = mpg ~ horsepower, data = trainSET)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3177 -3.5428 -0.5591  2.3910 14.6836 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 41.283548   1.044352   39.53   <2e-16 ***
horsepower  -0.169659   0.009556  -17.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.032 on 194 degrees of freedom
Multiple R-squared:  0.619, Adjusted R-squared:  0.6171 
F-statistic: 315.2 on 1 and 194 DF,  p-value: < 2.2e-16

4.1.1 Using caret to accomplish the same thing

library(caret)
set.seed(3)
trainIndex <- createDataPartition(y = Auto$mpg,
                                  p = 0.5,
                                  list = FALSE,
                                  times = 1)
training <- Auto[trainIndex, ]
testing <- Auto[-trainIndex, ]
dim(training)
[1] 198   9
dim(testing)
[1] 194   9
# 
myControl1 <- trainControl(method = "none")
myControl2 <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5)
#
mod_lm <- train(mpg ~ horsepower,
                data = training,
                trControl = myControl1,
                method = "lm")
#
mod_lm
Linear Regression 

198 samples
  1 predictor

No pre-processing
Resampling: None 
summary(mod_lm)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4938 -3.7061 -0.4894  2.9283 14.9663 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.06917    1.03089   38.87   <2e-16 ***
horsepower  -0.15575    0.00909  -17.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.086 on 196 degrees of freedom
Multiple R-squared:  0.5997,    Adjusted R-squared:  0.5976 
F-statistic: 293.6 on 1 and 196 DF,  p-value: < 2.2e-16

We now use the predict() function to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set. Note that the -train index below selects only the observations that are not in the training set.

EMSE <- mean((Auto$mpg - predict(lm.fitA, newdata = Auto))[-train]^2)
EMSE
[1] 23.26601
# Or
EMSE <- mean((validSET$mpg - predict(lm.fitA, newdata = validSET))^2)
EMSE
[1] 23.26601

Use the RMSE() function from caret to compute the \(MSE_{testing}\).

# Using caret
RMSE(predict(mod_lm, testing), testing$mpg)^2
[1] 22.50887

Therefore, the estimated test MSE for the linear regression fit is 23.2660086. We can use the poly() function to estimate the test error for the polynomial and cubic regressions.

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2 <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2
[1] 18.71646
# Or
EMSE2 <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2
[1] 18.71646

Use the RMSE() function from caret to compute the \(MSE_{testing}\) for a second order polynomial model.

# Using caret
mod_lm2  <- train(mpg ~ poly(horsepower, 2),
                  data = training,
                  trControl = myControl1,
                  method = "lm")
RMSE(predict(mod_lm2, testing), testing$mpg)^2
[1] 19.592

lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3 <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3
[1] 18.79401
# Or
EMSE3 <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3
[1] 18.79401


Use the RMSE() function from caret to compute the \(MSE_{testing}\) for a third order polynomial model.

# using caret---Your CODE HERE
mod_lm3  <- train(mpg ~ poly(horsepower, 3),
                  data = training,
                  trControl = myControl1,
                  method = "lm")
RMSE(predict(mod_lm3, testing), testing$mpg)^2
[1] 19.54432


These error rates are 18.7164595 and 18.7940068, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.

set.seed(231)
train <- sample(392, 196)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fit1 <- lm(mpg ~ horsepower, data = Auto, subset = train)
EMSE1a <- mean((Auto$mpg - predict(lm.fit1, newdata = Auto))[-train]^2)
EMSE1a
[1] 25.14997
# Or
EMSE1a <- mean((validSET$mpg - predict(lm.fit1, newdata = validSET))^2)
EMSE1a
[1] 25.14997
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2a <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2a
[1] 21.82773
# Or
EMSE2a <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2a
[1] 21.82773
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3a <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3a
[1] 21.91503
# Or
EMSE3a <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3a
[1] 21.91503

Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are 25.1499689, 21.8277328, and 21.9150317, respectively.

These results are consistent with our previous findings: a model that predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favor of a model that uses a cubic function of horsepower.

4.2 Leave-One-Out Cross-Validation

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,

glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
(Intercept)  horsepower 
 39.9358610  -0.1578447 

and

lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)
(Intercept)  horsepower 
 39.9358610  -0.1578447 

yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot package.

library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err = cv.glm(data = Auto, glmfit = glm.fit)
cv.err$delta[1]
[1] 24.23151


Use the method = "LOOCV" from caret to estimate the leave one out cross validation estimate of the test error for a model that regresses mpg onto horsepower.

# Using caret
myControl3 <- trainControl(method = "LOOCV")
mod_CV  <- train(mpg ~ horsepower,
                data = Auto,
                trControl = myControl3,
                method = "lm")
mod_CV$results
  intercept     RMSE  Rsquared      MAE
1      TRUE 4.922552 0.6012358 3.848748
mod_CV$results$RMSE^2
[1] 24.23151


The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this cv.glm() case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic. Below, we discuss a situation in which the two numbers differ. Our cross-validation estimate for the test error is approximately 24.2315135.

We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we use the for() function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the i\(^{\text{th}}\) element of the vector cv.error. We begin by initializing the vector. This command will likely take a couple of minutes to run.

cv.error <- numeric(5)
for(i in 1:5){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error[i] <- cv.glm(data = Auto, glmfit = glm.fit)$delta[1]
}
cv.error
[1] 24.23151 19.24821 19.33498 19.42443 19.03321

Use the method = "LOOCV" from caret to estimate the leave one out cross validation estimate of the test error for increasingly complex polynomial fits of order i = 1 to i = 5 that regress mpg onto horsepower.

# Using caret
degree <- 1:5
CV <- numeric(5)
for(d in degree){
f <- bquote(mpg ~ poly(horsepower, .(d)))
mod_CV  <- train(as.formula(f),
                  data = Auto,
                  trControl = myControl3,
                  method = "lm")
CV[d] <- mod_CV$results$RMSE^2
}
CV
[1] 24.23151 19.24821 19.33498 19.42443 19.03321

4.2.1 k-Fold Cross-Validation

The cv.glm() function can also be used to implement \(k\)-fold CV. Below we use k = 10, a common choice for \(k\), on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.

set.seed(17)
cv.error.10 <- numeric(10)
for(i in 1:10){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error.10[i] <- cv.glm(data = Auto, glmfit = glm.fit, K = 10)$delta[1]
}
cv.error.10
 [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
 [9] 18.87013 20.95520


Use the method = "cv" with number = 10 for K= 10 folds from caret to estimate the 10 fold cross validation estimate of the test error for increasingly complex polynomial fits of order i = 1 to i = 10 that regress mpg onto horsepower.

# using caret---Your CODE HERE
set.seed(172)
myControl4 <- trainControl(method = "cv",
                           number = 10)
degree <- 1:10
CV <- numeric(10)
for(d in degree){
f <- bquote(mpg ~ poly(horsepower, .(d)))
mod_CV  <- train(as.formula(f),
                  data = Auto,
                  trControl = myControl4,
                  method = "lm")
CV[d] <- mod_CV$results$RMSE^2
}
CV
 [1] 23.91805 18.80915 19.23979 18.78304 18.87683 18.56381 18.64721 18.71842
 [9] 18.76923 18.83391


Notice that the computation time is much shorter than that of LOOCV. (In principle, the computation time for LOOCV for a least squares linear model should be faster than for \(k\)-fold CV, due to the availability of \[CV_{(n)} = \frac{1}{n}\sum_{i = 1}^n\left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2,\] for LOOCV; however, unfortunately the cv.glm() function does not make use of this formula.) We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.

We saw in the last section that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform \(k\)-fold CV, then the two numbers associated with delta differ slightly. The first is the standard \(k\)-fold CV estimate, as in \[CV_{(k)} = \frac{1}{k}\sum_{i=1}^kMSE_i.\] The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.

4.3 10-Fold Cross-Validation

Compute 10-fold Cross Validation errors for mod.be, mod.fs, mod1, mod2, and mod3 from Project 1.

site <- "http://ww2.amstat.org/publications/jse/datasets/homes76.dat.txt"
HP <- read.table(file = site, header = TRUE)
# Removing indicated columns
HP <- HP[ ,-c(1, 7, 10, 15, 16, 17, 18, 19)]
# Renaming columns
names(HP) <- c("price", "size", "lot", "bath", "bed", "year", "age", 
               "garage", "status", "active", "elem")
mod.be <- glm(price ~ size + lot + bed + status + elem, data = HP)
mod.fs <- glm(price ~ elem + garage + lot + active + size + bed, data = HP)
mod1 <- glm(price ~ . - status - year, data = HP)
mod2 <- update(mod1, .~. + bath:bed + I(age^2))
mod3 <- glm(price ~ size + lot + bath + bed + bath:bed + age + I(age^2) + 
              garage + active + I(elem == 'edison') + I(elem == 'harris'), 
            data = HP)
library(boot)
set.seed(461)
CV10mod.be <- cv.glm(data = HP, glmfit = mod.be, K = 10)$delta[1]
CV10mod.be
[1] 2372.26
CV10mod.fs <- cv.glm(data = HP, glmfit = mod.fs, K = 10)$delta[1]
CV10mod.fs
[1] 2416.15
CV10mod1 <- cv.glm(data = HP, glmfit = mod1, K = 10)$delta[1]
CV10mod1
[1] 2844.668
CV10mod2 <- cv.glm(data = HP, glmfit = mod2, K = 10)$delta[1]
CV10mod2
[1] 2639.653
CV10mod3 <- cv.glm(data = HP, glmfit = mod3, K = 10)$delta[1]
CV10mod3
[1] 2297.885


Use the method = cv" with number = 10 from caret to estimate the ten fold cross validation estimate of the test error for the five models from Project 1.

# Using caret
set.seed(461)
myControl5 <- trainControl(method = "cv",
                           number = 10)
mod_BE <- train(price ~ size + lot + bed + status + elem, 
                data = HP,
                trControl =myControl5,
                method = "lm")
mod_BE$results$RMSE^2
[1] 2142.238
### Below is what the others should have
mod_FS <- train(price ~ elem + garage + lot + active + size + bed, 
                data = HP,
                trControl =myControl5,
                method = "lm")
mod_FS$results$RMSE^2
[1] 2050.674
mod_MOD1 <- train(price ~ . - status - year, 
                  data = HP,
                  trControl =myControl5,
                  method = "lm")
mod_MOD1$results$RMSE^2
[1] 2492.183
mod_MOD2 <- train(price ~ . - status - year + bath:bed + I(age^2), 
                  data = HP,
                  trControl =myControl5,
                  method = "lm")
mod_MOD2$results$RMSE^2
[1] 2480.545
mod_MOD3 <- train(price ~ size + lot + bath + bed + bath:bed + age + I(age^2) + 
              garage + active + I(elem == 'edison') + I(elem == 'harris'), 
                  data = HP,
                  trControl =myControl5,
                  method = "lm")
mod_MOD3$results$RMSE^2
[1] 2164.437


Complete the table below using inline R code to report your answers.

Table of Results

Model MSE from cv.glm() MSE from caret
Backward elimination 2372.26 2142.24
Forward selection 2416.15 2050.67
Model 1 (mod1) 2844.67 2492.18
Model 2 (mod2) 2639.65 2480.55
Model 3 (mod3) 2297.88 2164.44

This document uses DT by Xie, Cheng, and Tan (2024), ggplot2 by Wickham et al. (2024), ISLR by James et al. (2021), ISLR2 by James et al. (2022),,MASS by Ripley (2024), plotly by Sievert et al. (2024), rmarkdown by Allaire et al. (2024), dplyr by Wickham et al. (2023), knitr by Xie (2024b), class by Ripley (2023), and bookdown by Xie (2024a).

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bookdown_0.40  rmarkdown_2.27 boot_1.3-30    caret_6.0-94   lattice_0.22-6
 [6] class_7.3-22   ISLR2_1.3-2    broom_1.0.6    car_3.1-2      carData_3.0-5 
[11] ISLR_1.4       MASS_7.3-60.2  dplyr_1.1.4    plotly_4.10.4  ggvis_0.4.9   
[16] ggplot2_3.5.1  DT_0.33        knitr_1.48    

loaded via a namespace (and not attached):
 [1] pROC_1.18.5          rlang_1.1.4          magrittr_2.0.3      
 [4] compiler_4.4.0       mgcv_1.9-1           vctrs_0.6.5         
 [7] reshape2_1.4.4       stringr_1.5.1        pkgconfig_2.0.3     
[10] fastmap_1.2.0        backports_1.5.0      labeling_0.4.3      
[13] utf8_1.2.4           promises_1.3.0       prodlim_2024.06.25  
[16] purrr_1.0.2          xfun_0.45            cachem_1.1.0        
[19] jsonlite_1.8.8       recipes_1.1.0        highr_0.11          
[22] later_1.3.2          jpeg_0.1-10          parallel_4.4.0      
[25] R6_2.5.1             bslib_0.7.0          stringi_1.8.4       
[28] parallelly_1.37.1    rpart_4.1.23         lubridate_1.9.3     
[31] jquerylib_0.1.4      Rcpp_1.0.12          assertthat_0.2.1    
[34] iterators_1.0.14     future.apply_1.11.2  httpuv_1.6.15       
[37] Matrix_1.7-0         splines_4.4.0        nnet_7.3-19         
[40] timechange_0.3.0     tidyselect_1.2.1     rstudioapi_0.16.0   
[43] abind_1.4-5          yaml_2.3.8           timeDate_4032.109   
[46] codetools_0.2-20     listenv_0.9.1        tibble_3.2.1        
[49] plyr_1.8.9           shiny_1.9.1          withr_3.0.0         
[52] evaluate_0.24.0      future_1.33.2        survival_3.5-8      
[55] pillar_1.9.0         foreach_1.5.2        stats4_4.4.0        
[58] generics_0.1.3       munsell_0.5.1        scales_1.3.0        
[61] globals_0.16.3       xtable_1.8-4         glue_1.7.0          
[64] lazyeval_0.2.2       tools_4.4.0          data.table_1.15.4   
[67] ModelMetrics_1.2.2.2 gower_1.0.1          grid_4.4.0          
[70] tidyr_1.3.1          crosstalk_1.2.1      ipred_0.9-14        
[73] colorspace_2.1-0     nlme_3.1-164         cli_3.6.2           
[76] fansi_1.0.6          viridisLite_0.4.2    lava_1.8.0          
[79] gtable_0.3.5         sass_0.4.9           digest_0.6.35       
[82] htmlwidgets_1.6.4    farver_2.1.2         htmltools_0.5.8.1   
[85] lifecycle_1.0.4      hardhat_1.4.0        httr_1.4.7          
[88] mime_0.12           

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://www.statlearning.com.
———. 2022. ISLR2: Introduction to Statistical Learning, Second Edition. https://www.statlearning.com.
Ripley, Brian. 2023. Class: Functions for Classification. http://www.stats.ox.ac.uk/pub/MASS4/.
———. 2024. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. http://www.stats.ox.ac.uk/pub/MASS4/.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2024. Plotly: Create Interactive Web Graphics via Plotly.js. https://plotly-r.com.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Xie, Yihui. 2024a. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.
———. 2024b. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2024. DT: A Wrapper of the JavaScript Library DataTables. https://github.com/rstudio/DT.

  1. The material in Section 1 is modified from the Chapter 2 lab of An introduction to Statistical Learning.↩︎

  2. Use the argument dpi = 96 inside include_graphics().↩︎